BIOST 561: using the command-line and the
server
Lecture 7
Announcements
- I will NOT be here next week. I will post a
recording of next week’s lecture on Thursday 5/22.
- I will be grading HW3 this weekend. (From what I’ve seen so far,
they look fantastic!)
- HW4 will be released either Monday or Tuesday, but will not be due
until June 2nd (Monday)
- HW4 will ask you for more details about what you’re envisioning for
the final project (i.e., making a Pkgdown), which will be due June 13
(Friday), for me to submit your grades (Credit/No-credit) by June 17
- The final project specifications will be also be released alongside
HW4
- Again, the final project is not meant to take a lot
of time
Note 1 about HW3: Using expect_true()
- I personally only use
expect_true() when I write
tests
- (There’s nothing wrong with the other
testthat
functions, I’m just not used to using them and I forget what they
are)
expect_true() is expecting (in a literal sense) only
one TRUE
set.seed(0)
vec1 <- stats::rpois(5, lambda = 1)
set.seed(0)
vec2 <- stats::rpois(5, lambda = 1)
set.seed(1)
vec3 <- stats::rpois(5, lambda = 1)
vec1
## [1] 2 0 1 1 2
## [1] 2 0 1 1 2
## [1] 0 1 1 2 0
testthat::expect_true(vec1 == vec2)
## Error: vec1 == vec2 is not TRUE
##
## `actual`: TRUE TRUE TRUE TRUE TRUE
## `expected`: TRUE
- Uh oh! It’s not happy. In these cases, we want to use
all() (or, when you want to show two things aren’t the
same, any())
testthat::expect_true(all(vec1 == vec2))
testthat::expect_true(any(vec1 != vec3))
Note 2 about HW3: Imports:
vs. Suggests:
- Some of you had questions regarding
Imports: versus
Suggests: in your DESCRIPTION file
- You can look at https://r-pkgs.org/dependencies-mindset-background.html#sec-dependencies-imports-vs-suggests
- Short answer: The difference is how strictly you (the developer of
the package) require the users to have other R packages installed to use
your package
Imports: will ensure the user has all these packages
installed when they load your package
Suggests: is “it’ll be nice if a user has these
packages installed, in which case, they’ll load them. But if the user
doesn’t have these packages installed, my package can still be
loaded.”
Note 3 about HW3: An annoying utils warning
Many of you had a warning when you ran devtools::check()
that reads something like this:
Undefined global function or variables:
combn
Consider adding
importFrom("utils", "combn")
To fix this:
- In your
.R function, instead of using
combn(clique_nodes, 2) (or something like that), write
utils::combn(clique_nodes, 2). This
declares that you’re using the combn()
function from the utils package.
- Make sure
utils is under Imports: (or
Suggests:) in your DESCRIPTION file.
Note 4 about HW3: What did I mean “no correct solution”?
- Many tasks involving cliques for graphs are
notoriously hard.
- Consider the task of answering a “yes/no” question – does a given
graph have a clique of size
k or larger (for some value of
k)?
- This is question is what computer scientists calls
“NP-complete”
- We are not going to get into this since it’s way
beyond this course, but essentially this means: there is good way to
solve this problem aside from a brute-force search


- You’re coding a function to find the maximal partial clique (which
is different from the famous “clique decision” problem and also
different from the “maximal clique” problem), but take my word, it’s
still just as impossibly hard (although I do not have a proof of
this)
- If you’re fascinating by “how we classify how difficult a
computational problem is”, then you probably would enjoy a CS theory
class.
Debugging
- The reason I didn’t say too much about debugging until now is
because there aren’t very many general principles/tools to help
debugging.
- Debugging is very person-specific and code-specific
- Nonetheless, I can provide a few guiding principles on how to
conceptualize debugging
- There’s two types of “bugs”:
- Your code crashed (i.e., did not complete) or unexpectedly returned
a bunch of
NAs or NaNs.
- Your code gave an output that looks reasonable but it failed your
unit tests.
Two important principles on debugging: Reproducing errors
One: Find how to reproduce your errors
- That is, you want to design a minimal example that yields the same
“bug” consistently
- You can’t fix a problem if you don’t know what exactly the problem
is
- This is simpler said than done, especially as you start designing
more and more complicated functions
- If your function relies on random number generation, be sure to use
the
set.seed() function.
- This is why it’s important to:
- Refactor your code by packaging many lines of code into small/simple
functions (i.e., “children” functions).
- Testing the small/simple functions first, and work your way “up” to
the main function (i.e., “parent” functions)
- Both aspects together help making minimal examples easier
Two important principles on debugging: Tracing
Two: Tracing your code to find the specific line
that fails
- This is “detective work”.
- Your code crashing or your unit test failing is “the murder at the
crime scene.” Then, you need to use your thinking and understanding of
code to gather “evidence” that points to the “culprit”.
- Just like a true crime, the specific function that caused the code
to crash might not be the actual culprit. The real murder could’ve been
done a long time before your code crashed.
median_random_rowSums <- function(mat,
trials = 1000){
p <- ncol(mat)
rowsum_vec <- sapply(1:trials, function(trial){
bool_vec <- stats::rbinom(p, size = 1, prob = 0.5)
idx <- which(bool_vec == 1)
mat_tmp <- mat[,idx]
return(rowSums(mat_tmp))
})
stats::median(rowsum_vec)
}
mat <- matrix(1:25, nrow = 5, ncol = 5)
median_random_rowSums(mat)
## Error in rowSums(mat_tmp): 'x' must be an array of at least two dimensions
- In this example, even though the “body at the crime scence” is the
rowSums(mat_tmp) function, the real murder was done in
mat_tmp <- mat[,idx]. If idx were just a
vector of length one, mat_tmp would actually be a vector
(and not a matrix), which does not work with the rowSums()
function.
Two ways to do tracing
Non-interactive:
- You add
print() statements in strategic places in your
code to figure out the status of your code at different times.
- This helps you figure out: 1) where the code crashed, and 2) what
the status of the code is before the code crashed
- Typically, this strategy is better for when you’re trying to figure
out why your code crashed (rather than why your code doesn’t pass your
unit test despite not crashing).
median_random_rowSums <- function(mat,
trials = 1000){
p <- ncol(mat)
rowsum_vec <- sapply(1:trials, function(trial){
print(paste0("Trial: ", trial))
bool_vec <- stats::rbinom(p, size = 1, prob = 0.5)
idx <- which(bool_vec == 1)
print(idx)
mat_tmp <- mat[,idx]
vec <- rowSums(mat_tmp)
print(vec)
return(vec)
})
stats::median(rowsum_vec)
}
set.seed(0) # to reproduce my errors!
mat <- matrix(1:25, nrow = 5, ncol = 5)
median_random_rowSums(mat)
## [1] "Trial: 1"
## [1] 1 4 5
## [1] 38 41 44 47 50
## [1] "Trial: 2"
## [1] 2 3 4 5
## [1] 54 58 62 66 70
## [1] "Trial: 3"
## [1] 4
## Error in rowSums(mat_tmp): 'x' must be an array of at least two dimensions
Interactive:
- You can use the
browser() function or set breakpoints
in RStudio.
- You can see https://adv-r.hadley.nz/debugging.html
for more details
- I personally don’t use this as often since it’s a bit clunky.
However, it is very useful when you’re debugging code
that didn’t crash but still gave the wrong answer. This is because you
can interactively write code to check all the variables inside a
function.
- (We’ll do some demos of this with any remaining time at the end of
class)
NOTE ABOUT OPERATING SYSTEMS FOR THIS LECTURE
- The procedures for Windows laptops will be
slightly/dramatically different compared to Mac
laptops
- I (Kevin) only have a Mac, so I might not be as immediately familiar
with how adjust the procedures for Windows
- I will write (in these slides) my best educated-guess on what the
procedure for Windows is
Opening the terminal (Macs)
- You have
Terminal program on your Mac

- It’ll look like something like this. (Your screen might not look
exactly the same, depending on the style of your
Terminal)

Opening the terminal (Windows)
- You might already have
Windows PowerShell on your
laptop already, which is all you need.
Alternatively 1:
Alternatively 2:
Navigating around the terminal
Using Git from the terminal
- A list of commands for git
git status: Look at the status of your GitHub
repository (what are files that have changed, has it been staged,
etc.)
git add: Add (i.e., “stage”) a file to be
committed
git commit: Commit all the staged files
git push: Push your repository from your current
location onto GitHub.com
git pull: Pull your repository from GitHub.com to your
current location
- We will do an in-class demo of this now
Moving on: Interacting with the Biostat server
- For large datasets or a lot of computing, it’s somewhat unrealistic
to do all the computation on your local laptop
- Why?
- Your laptop might be “faster” than a typical computer (in terms of
new CPU/GPU), but it probably doesn’t have as much memory or storage as
a server
- Once you “close” your laptop, all your jobs are killed. That means
if you run a job that takes 12 hours, you have to literally not close or
put your laptop to sleep for 12 hours.
- Your collaborators can’t see the results (since we don’t put large
datasets on GitHub, but your collaborator also can’t access the files on
your laptop)
Logging into the server
It should look something like this:

- Using the
Terminal (for Macs) or
Windows PowerShell (for Windows):

- You want to type in:
ssh [username]@bayes.biostat.washington.edu

- You’ll be asked for the password. After successfully typing in your
password, you’ll arrive at the landing page for the server.
Congratulations!

Moving files to/from the server
Two ways:
Via GitHub – great for code and figures
Via scp in the Terminal (for Macs) or in
Windows PowerShell (or Command Prompt or
MobaXterm) (for Windows) – great for data and
results
For our course, we will primarily use GitHub to
transfer files to/from your laptop to Bayes.
- Technically, this isn’t ideal for data and results because GitHub
doesn’t like files above 50 Mb
- However, we will do this for the course since 1) nothing you
generate in this course should be a large file, and 2) it will make your
life easier since there’s a lot of new things you’re already
learning
- In a literal sense though, I would tell you about
scp
(hopefully) sometime later in the course
In-class preparation (Part 1):
- [[I’ll be doing this live. We’ll spend 5-10 minutes to let you
follow along on your laptop.]]
- I put two files on Canvas. Please download them to your local laptop
and put them into your
UWBiost561 R package under the
vignettes folder
- Try this (if you can get it to work):
- On your laptop: Open up
Terminal (on Mac) or
Windows PowerShell (on Windows)
- Navigate to your
UWBiost561 R package via the
cd command
- Type in
git status into the command line. This
should show your Git repository’s status, and it should
also show the 2 new files you’ve just added
- Type in
git add * into the command line. This is a
“lazy way” to simply add/“stage” all your files for the commit. (This is
functionally equivalent to clicking the check-box when you had added
files via R Studio)
- Type in
git commit -m "Pushing code for server" into
the command line. (In general, the things in "..." is your
commit message.)
- Type in
git push origin main. It will ask you for your
username and your GitHub PAT (this is the super-secret
password you had made in HW1. It’s the one that starts with
ghp_…)
- Or (if you couldn’t get the command-line to work)
- You can add & commit & push code to your GitHub via R
Studio, as you’ve already been doing for the HWs
- Now, in your Internet Browser, double check you can see your files
on GitHub.com
- Now, log into Bayes
- This will also involve using the
Terminal or
Windows PowerShell (see a few slides ago)
In-class preparation (Part 2):
- So far so good! Now we’re going to pull your GitHub into your Bayes.
(This is where you’ll probably get lost. Remember, you’re
not looking at the files on your own personal laptop.
You’re now on the departmental server.)
- We’re going to do the simplest thing (which is not
what I usually recommend, which involves more steps). I will tell you
what I actually recommend in a future lecture
- On the internet browser, find your GitHub repository’s URL. (For
example, this might be:
https://github.com/UW561/UWBiost561)
- Back in
Terminal or Windows PowerShell,
type cd ~. (This makes sure you are in your home
directory.)
- Then, type in
git clone https://github.com/UW561/UWBiost561 (or whatever
your GitHub is – this will require you to enter your GitHub username and
your GitHub PAT)
- You should download your entire GitHub repository into your home
directory on Bayes.

- You can look around! Type in
cd UWBiost561/ (you can
tab-complete)
- To see all the files, you can type
ls
- You can look in your vignettes folder, via
cd vignettes/ and then look at the contents of folder via
ls

Submitting a job on the server
Okay, so what exactly did you put into your UWBiost561
package that’s now on Bayes as well?
There are multiple moving parts:
- The
.R script: demo_bayes.R is a simple R
script that computes the eigen-decomposition of a big, random
matrix
- The
.slurm script: demo_bayes.slurm is a
shell script (i.e., code that interacts with an operating system
directly) that tells Bayes how to run
demo_bayes.R
Looking at demo_bayes.R (nothing too special)
# store some useful information
date_of_run <- Sys.time()
session_info <- devtools::session_info()
set.seed(10)
# generate a random matrix
p <- 2000
mat <- matrix(rnorm(p^2), p, p)
mat <- mat + t(mat)
# print out some elements of the matrix
print(mat[1:5,1:5])
# compute eigenvalues
res <- eigen(mat)
# save the results
save(mat, res,
date_of_run, session_info,
file = "~/demo_bayes_output.RData")
print("Done! :)")
What on earth is a .slurm script?
- What the heck is SLURM anayways?
- SLURM is a job manager. (We will talk about server etiquette more
next week.) It manages the computing resources for many
many people

- You will run your job on the server “in the background”
- This means even if your connection with the server breaks, your job
will still be running. (This is a good thing! You don’t want to keep
your laptop on for hours and hours.)
- To do this, the server needs to allocate computing resources for
your job
- How much memory does the job need? How much time the job need?
- Remember, the server is for all users on the system. The SLURM
script helps the server determine how to best allocate resources
(primarily time and computing power)
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- This is what a SLURM script looks like.
- It specifies a bunch of different things
- Realistically, you make a new SLURM script (via copy-paste of a
previous one) and just change a few things. (I personally never
memorized how to write a SLURM script from scratch)
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- Your SLURM script must must must always start with
this line.
- (Don’t ask why. This is one of those things where you do without
questioning it. If you must know, you can google “hashbang”)
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- This is the most important line in the SLURM script
R CMD BATCH is the command-line function to run a
.R file
- The flags
--no-save --no-restore are optional. (It
makes your life just a bit easier, so might as well keep them
around.)
- The argument is
demo_bayes.R, which is the
.R file you wish to run
- That is, this file (
demo_bayes.slurm) is telling Bayes:
“Hey, I wish to run the script demo_bayes.R”
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- This is the name of your job
- The flag is
--job-name and the value I’m setting it to
is demo
- (More on this later)
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- This is the account you’re using to run your job
- The flag is
--account and the value I’m setting it to
is biostat
- You will never change this when you’re using Bayes
(since you’re a Biostat student running your job on the Biostat server),
so don’t worry about this. Just keep it here and never touch it
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- This is the partition your running your job on
- The flag is
--partition and the value I’m setting it to
is students-12c128g
- This is student partition on Bayes. Only students (you guys) can use
it. I (Kevin) cannot use it.
- (More on this later – you can change to another partition if you
think the current partition you’re using is too busy)
What does a typical .slurm script look like?
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=biostat
#SBATCH --partition=students-12c128g
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=10gb
R CMD BATCH --no-save --no-restore demo_bayes.R
- After the
R CMD BATCH line, this is the next most
important line in the entire SLURM script
- The flags are
--time and --mem-per-cpu,
and the values I’m setting them to are 12:00:00 and
10gb
- This tells Bayes: “Hey Bayes, I think my job will need at
most 12 hours in time and at most 10
Gigabytes of memory
- This helps Bayes figure out how to allocate resources for your
job
- (More on this next week, when we talk about server etiquette.
Remember, Bayes is for the entire department, not just
you!)
In-class, running your first SLURM script on the server
- [[I’ll be doing this live. We’ll spend 5-10 minutes to let you
follow along on your laptop.]]
- Navigate to be inside your
vignettes/ on Bayes (using
cd)
- Make sure you’re in the right place using
ls. You
should see demo_bayes.R and demo_bayes.slurm
(along all your other HW vignettes)
- In the command line, type in
sbatch demo_bayes.slurm.
You should see Submitted batch job ...
- This script only takes a minute or two to run. While it’s still
running, type in
squeue --me. This will show you which jobs
you are running and their status

- After a minute or two, your job will finish. You’ll notice a bunch
of new files. For example, in your current
vignettes/
directory, you’ll see demo_bayes.Rout (and a
slurm-[job ID].out – we usually don’t need to look at
latter file).
- You can type
cat demo_bayes.Rout into the command line
to see what the contents of this file is. It’s a text file, and it
literally prints out everything that happened in the R session that you
ran in the background


- What about the output? Recall we actually saved it outside the
GitHub repository in
demo_bayes.R via
save(..., file = "~/demo_bayes_output.RData"). It’s under
our home directory. Navigate to it via cd ~.

- How do you see the results (in
demo_bayes_output.RData)? Bayes is a server that has R, so
you can just open up R! Type in R.

- Then, you can interact with this just like any other R session. The
only difference is that (compared to R Studio), you
only have the R console. You don’t have any other
panels or drop-down menus. So you’ll need to learn all the R commands to
navigate around (which is why I’ve been typically telling you the
specific R commands to do things, instead of which buttons to click on
in R Studio)

Hm… about that interactive R session
Question: Why didn’t we just run our demo_bayes.R script
in this interactive R session (albeit it not having a fancy GUI like R
Studio).
Answer: More on this next week when we talk about
server etiquette!!
A few notes about .slurm scripts
- The file suffix of a SLURM script (for example,
demo_bayes.slurm) does not technically
need to be .slurm. It also did not
technically need to be called demo_bayes.
- HOWEVER: I would strongly
recommend you to follow this practice. This helps you remember: 1)
certain files are SLURM scripts, and 2) which
.slurm files
are associated with which .R files
Some practical advice about editting code (that might contradict my
actual advice outside this course)
- I would recommend you to not edit your
UWBiost561 R package on the server
- (This would require me to tell you about
vim. But more
importantly, it’s very easy to have conflicting versions of
UWBiost561 package if you simultaneously changes on your
local laptop and on Bayes)
- Instead, I would suggest you (for this course) to:
- Modify your code in R Studio on your laptop
- Commit your changes to your Git repository and push them onto
GitHub
- On the Bayes server, pull your new changes into your
UWBiost561 repository
- (Hence, technically speaking, this is a one-way edit. You only ever
modify your code on your local laptop. While in general, you don’t need
to abide by this practice, it will make your life
easier if you feel overwhelmed by the command line and the server.)
Grapichal explanation

Grapichal explanation

Grapichal explanation

What’s to come in HW4
- As I’ve stated, with the completion of HW3, you all have a working
implementation of
compute_maximal_partial_clique()
- I will be compiling all your versions (anonymized) into a couple
files that I will be upload to Canvas for you to download and put into
your
R folder in your UWBiost561 package
- You will be constructing a simulation study comparing all the
different implementations on Bayes
- HW4 will not technically be harder than HW3, but it will probably
feel overwhelming mainly because there’ll be more moving components
With the remaining time…
- More demos using the server or using Git in the command line
- We will get more practice with the server next time!
- Importantly, we will talk about:
- Server “etiquette” (since it’s a shared resource among all
department members)
- Effectively manage jobs/files
- Make simulation studies
- Git pulls (and merges and branches)
- Looking ahead:
- Lecture 8 will be about the above topics
- Lecture 9 will be about Pkgdown and Github.io (your personal
websites)
- Lecture 10 (last class) is a catch-all. We will discuss a bit about
your simulation results in HW4, a bit about Python, and a bit about
ChatGPT
Additional links
- Companion Google Docs for Linux commands and using Bayes
- Debugging
- Cheatsheet for Git commands
- Cheatsheet for Linux commands
- Cheatsheat for SLURM commands